Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e1071::svm(): Use formula interface only if factors are present #1740

Merged
merged 10 commits into from Jun 17, 2019

Conversation

mb706
Copy link
Contributor

@mb706 mb706 commented Mar 28, 2017

See if this fixes #1738

@mb706
Copy link
Contributor Author

mb706 commented Mar 28, 2017

Note 1: You guys could systematically go through the learners and look for learners that use the formula interface but could conceivably use a superior data.frame interface. Anecdotically, I have seen the shim that is e1071:::svm.formula many times before (where they just get the data.frame out of the formula and then use that). EDIT: This turns out to be a bit more complicated, since the formula interface is used for dummy-encoding of factor features. It could still be done with some work. Things to look out for are parameters that somehow refer to columns (e.g. weighting or scaling of different features) and need to be padded to the new number of columns, and whether or not an intercept column is added.

Note 2: I don't really like the test I implemented, it takes a relatively large amount of time and memory while really only testing for a design decision. Feel free to leave it out.

Note 3: If you think the test for many-feature-dataframes are a good idea, it would be possible to systematically apply that test to all the learners that are supposed to handle many-features-situations.

@mb706
Copy link
Contributor Author

mb706 commented Mar 28, 2017

This turns out to be harder than I thought, see my comment in the issue

@berndbischl
Copy link
Sponsor Member

i would not advise to merge this. see what i also posted in the issue. we should create a clean issue out of this.

@mllg
Copy link
Sponsor Member

mllg commented Apr 3, 2017

The latest changes (only using formula if factors are present) looks good to me.

@larskotthoff
Copy link
Sponsor Member

Yep, looks good to me.

@berndbischl
Copy link
Sponsor Member

Yep, looks good to me.

how about commenting on the general issue i raised? this is more then "hey code looks good here".
i mean you can argue that we can implement this only for this learner as this is a clear improvement, but then please do so. and that we should possibly open up a more general issue for the general problem and not block this improvement.

also does anybody see any PROBLEM that thi might introduce? or is it just a clear enhancement?

@berndbischl
Copy link
Sponsor Member

i mean, just not the possible side effects this has with this PR #1763

what happens if we merge that too?
no the formula is taken ONLY if we task ALSO contains factors? pretty weird semantics right?
or stated otherwise: does the svm then support the (now not existing) "formula" property...?

@larskotthoff
Copy link
Sponsor Member

As I've said in the other thread, an alternative would be to have two separate learners for this. That would avoid issues with being able to specify formulas.

@berndbischl
Copy link
Sponsor Member

As I've said in the other thread,

where exactly please?

@larskotthoff
Copy link
Sponsor Member

#1738 (comment)

@berndbischl
Copy link
Sponsor Member

#1738 (comment)

i am not a big fan of this. it is
a) confusing for users and overly complex to undertand
b) it copies code A LOT.
c) the svm mlr learner should really do this internally

@giuseppec
Copy link
Contributor

giuseppec commented Apr 5, 2017

I have to agree with bernd. We should not try to autodetect this and internally do a if else here that changes the behaivour. I thought that our convention was to always use the formula interface whenever possible (e.g. randomForest). I think the user should be able to choose this himself.
Have to think about this a bit longer but one possible solution could be to introduce a formula (and a data.frame) property and let the user decide what to use (e.g. using a configureMlr or as an option of the learner or by checking if the task contains a formula and always prefer the formula interface if the learner supports this, otherwise use the data interface).

@giuseppec
Copy link
Contributor

giuseppec commented Apr 6, 2017

What cases do we have?

  • the learner supports both, using a formula and data.frame (or matrix)
  • the learner supports only using a formula
  • the learner supports only using a data.frame (or matrix)
  • ???

Concrete suggestion:

  • add a new option prefer.formula.interface to configureMlr
  • default should be prefer.formula.interface = TRUE which always prefers the formula interface whenever possible (i.e. when the learner has a formula property which we have to add)
  • learners that allow using a formula should have a formula properties

Consequences:

  • this way, we don't change the current behaivour and it should be always clear what mlr is doing (currently this is not even clear for users because it is not documented which learner uses which interface)
  • a user will be able to change the interface (and speedup) learners that also support using data.frames (or matrices)
  • contra: our tests will take longer as we have to test both options.

@mb706
Copy link
Contributor Author

mb706 commented Apr 6, 2017

My opinion: I suggest not to complicate the UI but instead make getTaskData to also conform to the formula if one is given (see my comment on the pr).

Different solution: Use the formula if a user specified formula was attached to the task (would add another condition to the if clause), and use the data.frame directly otherwise.

@larskotthoff
Copy link
Sponsor Member

What's the status here? Have we decided on a course of action?

@giuseppec
Copy link
Contributor

I think we haven't made a concrete decision yet. Do we even need this PR here which only solves this issue for SVM? Or do we want to solve this on an abstract level?
I still have this incomplete PR #1763 with some concrete TODOs where we could also try to address the issue mentioned here.

@pat-s
Copy link
Member

pat-s commented Aug 2, 2017

pushing this discussion as it sets the base for some PRs now (#1899, #1763, #1740)

@giuseppec is again working on #1763, specifying the formula within the task.
Is this agreed on? I think is not that practical because than the task can only be used with models supporting this particular formula notation (correct me if I'm wrong?)

There must always be one place where the formula is specified manually. Why is the ParamSet of a learner (as in #1899) a bad place?

@giuseppec
Copy link
Contributor

giuseppec commented Aug 2, 2017

  1. Yes, I started working at this again. But I am currently just playing around, still not finished ; -).
  2. No, we don't have an agreement yet. What I want to do: Everything that currently works should of course still work +I would like to add a possibility to specify the formula in the task (just optional). But I agree with you that this can then only be used for learners that support formula objects. However, as @mb706 suggested we could extend getTaskData so that it uses only the features mentioned in the formula.
  3. I don't think that adding a formula to the learner-object is bad in general (I rather think that maybe both should be possible, specifying the formula in task AND in learner objects).
    Suppose you want to apply a learner to 2 or more different tasks (e.g. using the benchmark function). Setting the formula in the learner will only work for one task (as you need to add the feature names).
  4. I think it is very important that we do this stepwise (maybe multiple smaller PR). A aig PR has a high chance that it will never be merged.

@pat-s
Copy link
Member

pat-s commented Aug 3, 2017

But I agree with you that this can then only be used for learners that support formula objects.

This would be ok if the same formula notation is accepted by multiple learners. However, in the case of mgcv::gam the s() notation is unique I think (at least I do not know of any other smooth notation). So if we would implement it this way, we would need to create an extra task only for mgcv::gam.

@berndbischl
what is the main issue that a formula should not be in the paramset for specific learners?

@pat-s pat-s changed the title Fix 1738 svm no formula e1071::svm(): Use formula interface only if factors are present Jun 6, 2019
@pat-s
Copy link
Member

pat-s commented Jun 6, 2019

Since the discussion about formula handling was ported to mlr3 and will probably not solved anymore here (which blocked the merging of this PR), is there something that would still speak against merging this SVM fix here?

@pat-s
Copy link
Member

pat-s commented Jun 17, 2019

@larskotthoff ready to merge?

@larskotthoff larskotthoff merged commit e6f045e into master Jun 17, 2019
@larskotthoff larskotthoff deleted the fix_1738_svm_no_formula branch June 17, 2019 17:39
vrodriguezf pushed a commit to vrodriguezf/mlr that referenced this pull request Jan 16, 2021
…org#1740)

* Testing svm with many features task

* svm use data.frame instead of formula

* spaces around match operator

* Only use svm data.frame interface if task is all numeric

* Deploy from Travis build 13884 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/542175846
Commit: 5565287

* add NEWS entry

* Deploy from Travis build 13922 [ci skip]

Build URL: https://travis-ci.org/mlr-org/mlr/builds/546742364
Commit: de67d1a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stack overflow for a not so big task
7 participants